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Abstract 



In the present paper we consider the problem of matrix completion 
C-H ' with noise for general sampling schemes. Unlike previous works, in our 

^^ ' construction we do not need to know or to evaluate the sampling distribu- 

tion or the variance of the noise. We propose new nuclear-norm penalized 
estimators, one of them of the "square-root" type. We prove that, up to 
Cu I a logarithmic factor, our estimators achieve optimal rates with respect to 

the estimation error. 



CS ! 1 Introduction 

> 



This paper considers the problem of matrix recovery from a small set of noisy 
observations. Suppose that we observe a small set of entries of a matrix. The 
problem of inferring the many missing entries from this set of observations is 
the matrix completion problem. A usual assumption that allows to succeed 

f~^ , such a completion is to suppose that the unknown matrix has low rank or has 

^S| ' approximately low rank. 

The problem of matrix completion comes up in many areas including col- 
laborative filtering, multi-class learning in data analysis, system identification 

. ; , in control, global positioning from partial distance information and computer 

r> ' vision, to mention some of them. For instance, in computer vision, this problem 

j^ , arises as many pixels may be missing in digital images. In collaborative filter- 

ing, one wants to make automatic predictions about the preferences of a user 
by collecting information from many users. So, we have a data matrix where 
rows are users and columns are items. For each user, we have a partial list of 
his preferences. We would like to predict the missing rates in order to be able 
to recommend items that may interest each user. 

The noiseless setting was first studied by Candes and Recht [6] using nuclear 
norm minimization. A tighter analysis of the same convex relaxation was carried 
out in [7]. For a simpler approach see [21] and [10]. An alternative line of 
work was developed by Keshavan et al in [12]. A more common situation in 
applications corresponds to the noisy setting in which the few available entries 
are corrupted by noise. This problem has been extensively studied recently. 



The most popular methods rehes on nuclear norm mimmization. (see, e.g., 
[5, 11, 22, 20, 17, 18, 9, 13]). One can also use rank penalization as it was 
done by Bunea et al [4] and Klopp [14]. Typically, in the matrix completion 
problem, the sampling scheme is supposed to be uniform. However, in practice, 
the observed entries are not guaranteed to follow the uniform scheme and its 
distribution is not known exactly. 

In the present paper we consider nuclear norm penalized estimators and 
study the corresponding estimation error in Frobenuis norm. We consider both 
cases when the variance of the noise is known or not. In our construction we 
do not need to know or to estimate the sampling distribution. Our results are 
valid for more general sampling schemes than the uniform one: we only assume 
that the sampling distribution satisfies some mild "regularity" conditions (see 
Assumptions 1 and 2). 

Let ^0 S ]j'"ix™2 ]3g ^i^Q unknown matrix. Our main results. Theorem 
10 and 7, show the following bound on the normalized Frobenius error of the 
estimators A that we propose in this paper: with high probability 

ll^^^olli <- log(mi + m2)max(TOi,m2)rank(ylo) 
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the symbol < means that the inequality holds up to a multiplicative numerical 
constant. This theorem guarantees, that the prediction error of our estimator 
is small whenever n > log(mi + 7712) max(r7ii, TO2)rank(Ao). This quantifies 
the sample size necessary for successful matrix completion. Note that, when 
rank(Ao) is small, this is considerably smaller then TOim2, the total number of 
entries. For large 7711,7712 and small r, this is also quite close to the degree of 
freedom of a rank r matrix, which is (mi + m2)r — r^ . 

An important feature of our estimator is that its construction requires only 
an upper bound on the maximum absolute value of the entries of Aq. This 
condition is very mild. A bound on the maximum of the elements is often 
known in applications. For instance, if the entries of Aq are some user's ratings 
it corresponds to the maximal rating. Previously, the estimators proposed by 
Koltchinskii et al. [18] and by Klopp [14] also require a bound on the maximum 
of the elements of the unknown matrix but their constructions use the uniform 
sampling and additionally require the knowledge of an upper bound on the 
variance of the noise. Other works on the matrix completion require more 
involved conditions on the unknown matrix. For more details see Section 3. 

More general sampling schemes were previously considered in [19, 20, 8]. In 
[19], Lounici considers a different estimator and measures the prediction error 
in the spectral norm. In [20, 8] the authors consider penalization using weighted 
trace- norm, which was first introduced by Srebro et al [23] . For this construction 
one needs to know the actual sampling distribution or to estimate the empir- 
ical frequencies. The weighted trace- norm, used in [20, 8], corrects a specific 
situation where the standard trace-norm fails. This situation corresponds to a 
non-uniform distribution where the row/column marginal distribution is such 
that some columns or rows arc sampled with very high probability (for a more 



thorough discussion see [23, 8]). Unhke [20, 8], we use the standard trace-norm 
penalization and our assumption on the samphng distribution (Assumption 1) 
guarantees that no row or column is sampled with very high probability. 

Most of the existing methods of matrix completion rely on the knowledge or 
a pre-estimation of the standard deviation of the noise. The matrix completion 
problem with unknown variance of the noise was previously considered in [13] 
using a different estimator which requires uniform sampling. Note also that 
in [13] the bound on the prediction error is obtained under some additional 
condition on the rank and the "spikiness ratio" of the matrix. The construction 
of the present paper is valid for more general sampling distributions and does 
not require such an extra condition. 

The remainder of this paper is organized as follows. In Section 2 we intro- 
duce our model and the assumptions on the sampling scheme. For the reader's 
convenience, we also collect notation which we use throughout the paper. In 
Section 3 we consider matrix completion in the case of known variance of the 
noise. We define our estimator and prove Theorem 3 which gives a general 
bound on its Frobenius error conditionally on bounds for the stochastic terms. 
Theorem 7, provides bounds on the Frobenius error of our estimator in closed 
form. Therefore we use bounds on the stochastic terms that we derive in Section 
5. To obtain such bounds, we use a non-commutative extension of the classical 
Bernstein inequality. Such inequalities were first obtained in a pioneering work 
of Ahlswede et al [1] and Tropp [24]. We use an extension of these ideas to the 
case of the sub-exponential tails due to Koltchinskii [15]. 

In Section 4 we consider the case when the variance of the noise is unknown. 
Our construction uses the idea of "square-root" estimators, first introduced by 
Belloni et al [2] in the case of the square-root Lasso estimator. Theorem 10, 
shows that our estimator has the same performances as previously considered 
estimators which require the knowledge of the standard deviation of the noise 
and of the sampling distribution. 

2 Preliminaries 

2.1 Model and sampling scheme 

Let Aq e M™i^™2 ]-,g g^jj unknown matrix, and consider the observations {Xi, Yi) 
satisfying the trace regression model 

Y,^tT{X^Ao)+a^^,i = l,...,n. (1) 

The noise variables ^^ arc independent, with ]E(^i) = and 'K{£,i) = 1; Xi are 
random matrices with dimension mi x 7TI2 and tr(y4.) denotes the trace of the 
matrix A. Assume that the design matrices Xi are i.i.d copies of a random 
matrix X having distribution 11 on the set 

X ^ {ej(mi)ej(m2), 1 < j < mj, 1 < fc < ma} , (2) 



where ei{m) are the canonical basis vectors in R™. Then, the problem of esti- 
mating Aq coincides with the problem of matrix completion with random sam- 
pling distribution 11 . 

One of the particular settings of this problem is the Uniform Sampling at 
Random (USR) matrix completion which corresponds to the uniform distribu- 
tion n. We consider a more general weighted sampling model. More precisely, 
let TTjk = P (X = Cj {mi)ej (1712)) be the probability to observe the (j, A:)-th en- 

mi 

try. Let us denote by Cfe = S 7r,fe the probability to observe an element from 

the fc-th column and by Rj ~ S TTjk the probability to observe an element from 

the j'-th row. Observe that max (C^, Rj) > 1/ min(TOi, 7712). 

As it was shown in [23], the trace-norm penalization fails in the specific 
situation when the row/column marginal distribution is such that some columns 
or rows are sampled with very high probability (for more details see [23, 8]). 
To avoid such a situation we need the following assumption on the sampling 
distribution: 

Assumption 1. There exists a positive constant L > 1 such that 
max (Ci, Rj ) < L/ min(TOi , 7712). 

In order to get bounds in the Frobenius norm, we suppose that each element 
is sampled with positive probability: 

Assumption 2. There exists a positive constant /-i > 1 such that 

TTjk > (^mim2)~\ 

In the case of uniform distribution L = ^ = 1. Let us set ||^||^,(n) = 
E ((A, X)2). Assumption 2 implies that 

\\A\\l^^n)>{mim2fir'\\A\\l (3) 

2.2 Notation 

We provide a brief summary of the notation used throughout this paper. Let 
A, B be matrices in R"'iX"'^ 

• We define the scalar product {A, B) ~ ti:{A^ B). 

• For < g < 00 the Schatten-q (quasi- )norm of the matrix A is defined by 

/ ■ , ^ \ i/« 

/ min(mi ,7712} \ 

||yl||,= S^ ^j{Ay] forO<g<oo and||A|| =cri(A), 

where {aj{A))j are the singular values of A ordered decreasingly. 



• 11^11^ = max I Gij I where A = (aij). 

• Ps is the projector on the linear vector subspace 5*. 

• S-^ is the orthogonal complement of S. 

• Let Uj{A) and Vj{A) denote respectively the left and right orthonormal 
singular vectors of A, Si{A) is the linear span of {uj{A)}, S2{A) is the 
linear span of {vj{A)}. 

. Let PiiB) = Ps^(A)BPs,UA) and P^(i?) = i? - Pi(i?). 

• Let TTij = P (X = ei(mi)e^(m2)) be the probability to observe the {i,j)- 
th element. 
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mi 7712 

• For j = 1 . . . ?Ti2, Cj — S TTij and for i = 1 . . . mi , Ri = E tt. 

2=1 i=l 

• R = diag(i?i, . . . , i?„ij and C = diag(Ci, . . . , C^J. 

• Let AI = max(TOi, 7712), rn = min(?ni, 7712) and d = mi + 7712- 

• \\A\\l^n)=^{{A.xr). 

• Let {ei}^^^ be an i.i.d. Rademacher sequence and we define 

^ n n 

S^ = -Ve,X, and E = - V^,X, (4) 

4=1 4=1 

Define the observation operator O : M"'ix™2 ^ r" as (0(A))^ = {X„ A). 



• 



. Q{A) = J-±{Y.,-{X,,A)f. 
y "7=1 

3 Matrix completion with known variance of the 
noise 

In this section we consider the matrix completion problem when the variance of 
the noise is known. We define the following estimator of Aq: 

A = argmin | ^V (F, - (X„ A)f + \\\A\\i I , (5) 

where A > is a regularization parameter and a is an upper bound on ||^o|loo- 
The following theorem gives a general upper bound on the prediction error 
of estimator A. Its proof is given in Appendix A. The stochastic terms ||I]|| and 
IJEflll play a key role in what follows. 



Theorem 3. Let Xi be i.i.d. with distribution II on X which satisfies Assump- 
tion 2 and 1 and X > 3||S]||. Assume that \\Aq\\^ < a for some constant a. 
Then, there exist numerical constants (ci,C2) such that 



< max < ci ^ r7ii?7i2ran 

mim2 



.k(Ao) (A^ + a^ (E(IIEhII))^) ,c2aV/^ 



with probability at least 1 -, where d = mi 

a 



m2- 



In order to get a bound in a closed form we need to obtain suitable upper 
bounds on E(|jSfl||) and, with probability close to 1, on ||I]||. We will ob- 
tain such bounds in the case of sub- exponential noise i.e. under the following 
assumption: 



Assumption 4. 



max Eexp (|^.j|/A') < oo. 

2— l,...,n 



Let ii' > be a constant such that max Eexp d'Cil/^') 1^ e- The following 

? — l,...,n 

two lemmas give bounds on ||S|| and E (||I]i(;||). We prove them in Section 5 
using the non-commutative Bernstein inequality. 

Lemma 5. Let Xi be i.i.d. with distribution IV on X which satisfies Assumption 
1 and 2. Assume that (Ci)"=i o,'''^ independent with E(Ci) = 0, E {Qf^ = 1 and 
satisfy Assumption J^. Then, there exists an absolute constant C* > that 
depends only on K and such that, for all t > with probability at least 1 — e~* 
we have 



1 " 

n ^ — ^ 



< C* max ■ 



l L{t + \ogid)) log(m) (^ + log(d)) I 
mn ' n j 



(6) 



where d ~ mi 



m2. 



Lemma 6. Let Xi be i.i.d. with distribution H on X which satisfies Assumption 
1 and 2. Assume that (Ci)"=i are independent with E(Ci) = 0, E (Cf ) = 1 and 
satisfy Assumption 4- Then, for n > Tnlog {d)/L, there exists an absolute 
constant C* > such that 



E 



I " 

II ^—^^ 



<c* 



2eLlog(d) 



where d — rtii + m2 ■ 



An optimal choice of the parameter t in these lemmas is t = log((i). Larger t 
leads to a slower rate of convergence and a smaller t does not improve the rate 
but makes the concentration probability smaller. With this choice oft the second 



terms in the maximum in (6) is negligible for n > n* where n* = 2 log (d)m/L. 
Then, we can choose 

A . 3CV./^^M^. (7) 

V mn 

where C* is an absolute numerical constant which depends only on K . If ^i are 

A^(0, 1), then we can take C* = 6.5 (see Lemma 4 in [13]). With this choice of 

A we obtain the following Theorem. 

Theorem 7. Let Xi be i.i.d. with distribution II on X which satisfies Assump- 
tion 2 and 1. Assume that ||^o|loo — ^ /'''" some constant a and that Assumption 
4 holds. Consider the regularization parameter X satisfying (7). Then, there ex- 
ist numerical constant c' such that 



\A-Aq\\1 , / ? 9x T , log(d)rank(Ao)M . /log(d) ., 

"^ < c' max<^max(cr^a2)^2 2^— 21^^ ^—^ — ,aV\/ ) ■ (8) 



with probability greater than 1 — 3/d. 

An important feature of our estimator is that its construction requires only 
an upper bound on the maximum absolute value of the entries of Aq (and an 
upper bound on the variance of the noise). This condition is very mild. Let 
us compare it with the conditions used in previous work on the noisy matrix 
completion. Except [13], the previous estimators require the prior knowledge 
of an upper bound on the standard deviation of the noise. In addition, in [11], 
a prior information on the rank of the unknown matrix as well as a matrix 
incoherence assumption( which is stated in terms of the singular vectors of 
Aq) are required. In [20] a prior information on the "spikiness ratio" asp = 

— 2£ of Ao is needed. In [18, 14, 13], similarly to our construction, 

II^0||2 

a prior bound on ||^o|loo i^ required. An important difference is that, in these 
papers, the upper bound is used in the choice of the regularization parameter A. 
This implies the dependence on a of the convex functional which is minimized in 
order to obtain A. A too large bound may jeopardize exactness of the estimation. 
In our construction, a determines the ball over which we arc minimizing our 
convex functional, which itself is independent of a. 

4 Matrix completion with unknown variance of 
the noise 

In this section we propose a new estimator for the matrix completion problem 
in the case when the variance of the noise a is unknown. Our construction 
is inspired by the square-root Lasso estimator proposed in [2]. We define the 
following estimator of Aq: 



AsQ = arg min ■ 

IIAII <a 



\ 



1 " 

-Y^{Y,-{X,,A)f + X\\A\\^\, (9) 



i=l 



where A > is a regularization parameter and a is an upper bound on ||^o|loo- 
Note that the first term of this estimator is the square root of the data-dependent 
term of the estimator that we considered in Section 3. This is similar to the 
principle used to define the square-root Lasso estimator for the usual vector 
regression model. 

Let us set p = . The following theorem gives a general 

16 jjL mi?7i2rank(Ao) 

upper bound on the prediction error of the estimator Asq. Its proof is given in 

Appendix D. 

Theorem 8. Let Xi he i.i.d. with distribution II on X which satisfies As- 
sumption 2 and 1 and ^JJi > X > 3 ||S|| /Q{Ao) . Then, there exist numerical 
constants {ci,C2) such that 



2 

Asq - An 
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^ <max |4^2777i7772rank(Ao) (q^{Ao)X^ + a^ (]K{\\j:R\\)y 



2 /log(rf) 
C2 a fiy 

2 

with probability at least 1 — — . 

In order to get a bound on the prediction risk in a closed form we use the 
bounds on j|S|| and EdlS/jJI) given by Lemmas 5 and 6 taking t = \og{d). It 

/l n 

remains to bound Q{Aq) = crW ~SC^- We consider the case of sub-Gaussian 

y 1^1=1 

noise: 

Assumption 9. There exists a constant K such that 

E[exp(i^i)] < cxp(<V2^) 

for all t > 0. 

Note that condition E^^^ = 1 implies that K < 1. Under Assumption 9, £,f 
are sub-exponential random variables. Then, the Bernstein inequality for sub- 
exponential random variables implies that, there exists a numerical constant C3 
such that, with probability at least 1 — 2exp{— 0377,}, one has 

3cr/2 > Q{Ao) > a/2. (10) 

Using Lemma 5 and the right-hand side of (10), for ?7 > 21og'^ {d)m/L, we can 
take 

A = 6C%/^^M^. (11) 

V mn 

Note that A does not depend on a and satisfies the two conditions required in 

Theorem 8. We have that 

A>3||E!|/0(^o) (12) 



with probability greater then 1 — \/d — 2exp{— can} and 



16/iTOi?n2rank(ylo) 
for n large enough, more precisely, for n such that 

n > C4 /iLA/rank(Ao) log(d) 
where C4 = 576 (C*)^. We obtain the following theorem. 



(13) 



(14) 



Theorem 10. Let Xi he i.i.d. with distribution H on X which satisfies Assump- 
tion 1 and 2. Assume that \\Ao\\ < a for some constant a and that Assumption 
9 holds. Consider regularization parameter A satisfying (11) and n satisfying 
(14). Then, there exist numerical constants (c",C3) such that, 



\A 



SQ 



Ao\ 



mim2 



< c" max < max((T^, a^)/i^ L 



log(d)rank(Ao)M 



a /i 



log(rf) 



with probability greater than 1 — 3/(i — 2 exp{— can}. 



(15) 



Note that condition (14) is not restrictive: indeed the sampling sizes n sat- 
isfying condition (14) are of the same order of magnitude as those for which 
the normalized Frobenius error of our estimator is small. Thus, Theorem 10 
shows, that Asq has the same prediction performances as previously proposed 
estimators which rely on the knowledge of the standard deviation of the noise 
and of the sampling distribution. 



5 Bounds on the stochastic errors 

In this section we will obtain the upper bounds for the stochastic errors ||S/j|| 
and E (||S/j||) defined in (4). In order to obtain such bounds we use the matrix 
version of Bernstein's inequality. The following proposition is obtained by an 
extension of Theorem 4 in [15] to rectangular matrices via self-adjoint dilation 
(cf., for example 2.6 in [24]) . Let Zi, . . . , Z„ be independent random matrices 
with dimensions mi x 7712. Define 



<jz = max 



E^ 



{Z^ZJ 



1/2 



n 



1/2^ 



and 



[/, = inf {/v > : Eexp {\\Z,\\/K) < e} . 



Proposition 11. Let Zi, . . . , Z„ be independent random matrices with dimen- 
sions mi X TO2 that satisfy E(Zi) = 0. Suppose that Ui < U for some constant 



U and all i ^ 1, . . . ,n. Then, there exists an absolute constant c* , such that, 
for all t > 0, with probability at least 1 — e~* we have 



1 " 



i=l 

where d = mi + TO2 



, t + log{d) ^J^ U\ t + log(rf) I 

< c max < (Tz\l , [/ log — ) , 

n \ <yz J "J 



5.1 Proof of Lemma 5 

Wc apply Proposition 11 to Zi ^ Ci^i- Wc first estimate az and U. Note that 
Zi is a zero-mean random matrix which satisfies 

\\z^\\ < \a 

Then, Assumption 4 implies that there exists a constant K such that Ui < K 
for all i = 1, . . . , n. We compute 

E {ZiZf) ^R and E {Zj Z,) = C 

where C (resp. R) is the diagonal matrix with Ck (resp. Rj) on the diagonal. 
This and the fact that Xi are i.i.d. implies that 

a^ = max (Ci, Rj) < L/m. 

Note that ina.x{Ci,Rj) > l/m which implies that \og{K/az) < \og{Km) and 
the statement of Lemma 5 follows. 



5.2 Proof of Lemma 6 

The proof follows the lines of the proof of Lemma 7 in [14]. For sake of com- 

Ln 

pleteness we give it here. Set t* — 2 log(rf). t* is the value of t such 

TO log (to) 

that the two terms in (6) are equal. Note that Lemma 5 implies that 



(||^|:o^^||>^) 



<dcyi^{~t^nm/ {{C*fL)] for t<t* (16) 



and 






>i <rfexp{-tn/(C*log(TO))} for t>t\ (17) 



Wc set v\ ~ nmj ((C*)^L), v^ ~ n/{C* log(TO)). By Holder's inequality we get 

21og(d)\ l/(2 1og(rf)) 



E 



1 /I 

-Y^QX, < E -Y^QX. 

T) ^ ^ \ Ti ^ ^ 



10 



l/21og(d) 



The inequalities (16) and (17) imply that 

/ „ 21og(d)\ l/2 1og(rf) .+ ^^ 

(+00 +CX3 



<y^{\og{d)i,-'°^'-^^Tilog{d)) + 2\ogid) z.2""°^^''^r(21og(d)) 

The Gamma-function satisfies the following bound: 

for x>2, r{x) < (^y 

(see e.g. [14] ). Plugging this into (18) we compute 



l/21og(d) 



l/(21og(d)) 

(18) 
(19) 



E 



1 " 

71 ^ ^ 



< Vel 



l/(21og(d)) 



((log(d))'°g(''Vi"'°^^'''2i-'°s('^) 

+ 2(log(d))2i°g(^)z.2"''°^^''^' 
Observe that n > n* implies ui log{d) < v\ and we obtain 

1 " 



E 



< 



'2elog(d) 



v\ 



(20) 



We conclude the proof by plugging v\ — nm/ (^{C*)'^L^ into (20). 

A Proof of Theorem 3 

It follows from the definition of the estimator A that 

-E(^'-(^-^)) +Ai|ii|i<-E(i^»-(^-^o))' + A||Ao||i 

i=l 1=1 

which, using (1), implies 

, ra „ 2 " 



i=l 



Hence, 



1 " 2 

-E(^.,^o-i) +2(E,Ao-i) +A||i||i<A||Ao||i 



i=l 



11 



where S = — V] £,iXi. Then, by the duahty between the nuclear and the operator 
norms, we obtain 



0[Ao-A 
Note that from (38) we get 



I A 



Olli 



A 



< 



A||i!|i<2||S||Po-i||i + A||Ao||i. (21) 



PAoiAo^A) - Pi,(Ao-i) . (22) 



This, the triangle inequality and A > 3 ||I]|| lead to 



i 

n 


o[Ao- 


-) 


< 2||i: 

2 


II 


PAo (^0 - A 





+ A 
1 








4^ 


p 


Ao {Ao - i) 


1 





{Ao - i) 



(23) 



Since V a{B) = Ps±(^a)BPs2(A) + Ps^(A)B and rank(P5,(^)S) < rank(A) we 
have that rank(PA(S)) < 2rank(A). From (23) we compute 



O 



(^0 - i) |[ < 3 A V2rank(Ao) 



A^Ao 



C(r 



2 5 

< - 

2 - 3 

For a < r < rn we consider the following constrain set 



(24) 



Pllco = l,PllL(n) > Jt^^T^^^.PIIi < V^PIl2 



log (6/5) n' 



(25) 
Note that the condition ||A|j^ < y^JIAHj is satisfied if rank(A) < r. 

The following lemma shows that for matrices A £ C{r) the observation op- 
erator O satisfies some approximative restricted isometry. Its proof is given in 
Appendix B. 

Lemma 12. Let Xi be i.i.d. with distribution U on X which satisfies Assump- 
tion 2 and 1. Then, for all A £ C{r) 

^\\0{A)\\l>\\\A\\l^^^^-U^Jirm,m2{^{\\^I,\\)f 

2 
with probability at least 1 — 3 • 

We need the following auxiliary lemma which is proven in Appendix E. 
Lemma 13. //A > 3||i;|| 

Pi(i-Ao) <5 P^„(i-Ao) 



12 



Lemma 13 implies that 



Set a 



A-Ao 



A-Ar 



<6 



Vao{A~Ao) ^ < V72rank(Ao) 



A-Ao 



By definition of A we have that a < 2a. We now consider 

1 



two cases, depending on whether the matrix — ( yl — ylg ) belongs to the set 
C(72rank(Ao)) or not. 

Case 1: Suppose first that 

implies that 



A- Ac 



i2(n) 



2 64 log(d) ^, ,^, 
< a^il- . „ _\ , then (3) 



log (6/5) n' 



A -Ac 
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l<4aV /''^°^(') 



log (6/5) n 



(26) 



and we get the statement of the Theorem 3 in this case 

2 



Case 2: It remains to consider the case 



1 



A-Ao 



^^2 / 64 log(d) 
L2(n) ~ V log (6/5) n' 



Lemma 13 implies that — lA — Aq) £ C (72 rank(^o)) and we can apply Lemma 

2 
12. From Lemma 12 and (24) we obtain that with probability at least 1 — -r 

one has 



JP-^ollL(n) < |Av/2rank(Ao) 



A-Ao 



3168Aia2rank(Ao)mim2(lE(llSfi!l))' 



1 



< 6A iimim2Tank{Ao) H — (?niTO2/i) 



A-Ao 



+ 3168 fi a^rank(Ao)mim2 (E (HSkH ))^ 
Now (3) and a < 2a imply that, there exist numerical constants ci such that 

\\A-Ao\\l<ci (AiTOim2)'rank(Ao)(A2 + a^ (E (||S]fl||))'^ 

which, together with (26), leads to the statement of the Theorem 3. 



B Proof of Lemma 12 

The main lines of this proof are close to those of the proof of Theorem 1 in 
[20]. Set £ = 44//rmim2 (E {\\Y.R\\)f. We wiU show that the probability of the 
following "bad" event is small 



6= <^3AeC(r) such that 



1 



\0{A)\\l-\\A\\l^n^ 



>2ll^llMn)+^ 



Note that B contains the complement of the event that we are interested in. 
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In order to estimate the probability of B we use a standard peeling argument. 

/ 64 log(d) , 6 ^ , ^, 

Let v ^ \ 7 — r^ — and a = -. For / e N set 

y log (6/5) n 5 

If the event B holds for some matrix A & C{r), then A belongs to some Si and 



|0(^)ll2-PllL(n) 



>^PIlL(n)+^ 



> ^a'-V + f 

For each T > v consider the following set of matrices 

C(r,r) = {AeC(r) : \\A\\1^^^^<t] 
and the following event 



(27) 



Bi 



3A(^C{i\a^v) 



1 



|0(^)ll2-|l^llL(n) 



>-«. + £ 



Note that A e Si implies that A e C{r,a'-v). Then (27) implies that Bi holds 
and we get B C US;. Thus, it is enough to estimate the probability of the 
simpler event Bi and then apply the union bound. Such an estimation is given 
by the following lemma. Its proof is given in Appendix C. Let 



Zt = sup 

A6C(r,T) 



-\\0{A)\\l-\\A\\l_ 



(n) 



Lemma 14. Let Xi be i.i.d. with distribution II on X which satisfies Assump- 
tion 2 and 1. Then, 



Zt> Y^r + 44/irmim2(E(||Sj^||))^ ) <exp(-C5nT2^ 



where C5 



128' 



Lemma 14 implies that V [Bi) < exp(— C5 na^':/^). Using the union bound 

we obtain 

00 
¥(B) < EP(i30 

00 

< E exp(-C5na^V^) 

00 

< E cxp (- (2 C5 n log(a) i/^) /) 
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/ 64 log(d) 
/ log (6/5) n 

exp(-2c5n log(a)r/^) exp(-log(d)) 



where we used e^ > x. We finally compute for u 

P(6)< 



1 — exp (—2 C5 n log(a) u'^) 1 — cxp (— log(d)) 
This completes the proof of Lemma 12. 



C Proof of Lemma 14 

Our approach is standard: first we show that Zt concentrates around its ex- 
pectation and then we upper bound the expectation. 



By definition, Zt = sup 

A£C{r,T) 

centration inequality (see e.g. [3, Theorem 14.2]) implies that 



ni=i V 



Zt > E [Zt) + ^ (^T ) ) < exp (-cg n ^2^ 



Massart's con- 



(28) 



where C5 = . Next we bound the expectation '&{Zt)- Using a standard 

128 
symmetrization argument (see e.g. [16, Theorem 2.1]) we obtain 



E {Zt) = E sup 

\A£C(r,T) 

< 2E I sup 

\AeCir,T) 



n \ 

^1=1 J 

n \ 

-j:e.{X.,Af 



where {ei}"^i is an i.i.d. Rademacher sequence. The assumption ||^||o^ = 1 
implies |(Xj, A)| < 1. Then, the contraction inequality (see e.g. [16]) yields 



E (Zt) < 8E sup 

\A£C{r,T) 

1 A 



1 " \ 

-J2^,{X,,A) 



8E sup \{^R,A)\ 

\AeC{r,T) 



where E^j = — Ve^Xi. For A G C(r, T) we have that 
n^=l 

\\A\\,< V^\\A\\, 

< VMrmim2||A||^^(n) 

< \/ nmiTn2rT 

where we have used (3). Then, by the duality between nuclear and operator 
norms, we compute 



E(Zt)<8E sup \{^R.A)\\ <8V/imim2rTE(||I]«||), 

\A\ A\\ -^<\/jr7Ti\rn2/FT I 
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Finally, using 



1 / 5 



1 8\ 5 



-i^-TJ+8Vf^rmm2rTE{\\j:n\\)< (^- + -j -T + 44Mrmim2 (E (||I]^||))^ 
and the concentration bound (28) we obtain that 



ZT>j^T + AAfirmim2iE{\\^R\\)y ] <cxp(-C5nT2) 



with C5 = as stated. 
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D Proof of Theorem 8 

Let us set A = Aq — Asq- We have that 

Q\Asq) - Q'{Ao) = - \\0{A)\\l + 2 ( -f]e«^., A 



n 
1 



|0(A)||^ + 2(S,A) 



where S = — y^^i^j- This implies 

i \\0 (A)ll^ - -2 (S, A) + [q{Asq) - Q{Ao)) {q{Asq) + Q{Ao)) . (29) 

We need the following auxiliary lemma which is proven in the appendix F. 
Lemma 15. // A > 3 ||i;|| /Q(Ao), then 

\\Pi^{A)l<2\\PA„{A)\\, 

where A = Asq — Aq. 

Note that from (38) we get 

\\Ao\\, \\Asq\1 < |1Pa„(A)11i - ||Pi(A)||^ . (30) 

The definition of Asq and (30) imply that 



Q{Ao) + Q{Asq) < 2QiAo) + A Polli - 



A 



SQ 



<2Q(Ao) + A(l|P^„(A)||,-||Pi^(A)||J 



and 



Q{Asq)^Q{Ao)<X{\\Ao\\,- 



^SQ 



<A(||P^,(A)||,-||Pi„(A)||J 
<A(2||PA„(A)||i-||Pi„(A)||J 



(31) 



(32) 
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Lemma 15 implies that 2 |1Pao(A)|1i - ||P;Ji^(A)||^ > 0. From (31) and (32) we 
compute 

{q{Asq) " Q{Ao)) [q{Asq) + Q{Ao)) < 

X (2 \\Pao{A)\\, ||Pi(A)||i) (2Q(Ao) + A (||Pa„(A)|Ii - ||p1(A)||i)) - 
AXQiAo) ||Pao(A)||i - 2XQiAo) ||Pi(A)||^ 

+ 2X' |1Pao(A)||? + A^ ||Pi(A)||J - 3A^ |1Pa„(A)||, ||Pi„(A)||i • 

(33) 

Lemma 15 implies that A^ ||Pi„(A)|| J - 3A2 |1P^^(A)|1^ ||P^^(A)||j < and we 
obtain from (33) 

{Q{Asq) - Q{Ao)) {q{Asq) + Q{Ao)) < 4Ag(Ao) |1Pao(A)||i 

- 2AQ(Ao) ||Pi„(A)||^ + 2A2 ||P^„(A)||^ 

(34) 

Plugging (34) into (29) we get 

i \\0 {A)\\l < -2 (S, A) + 4AO(Ao) I1Pa„(A)|1i 

-2AQ(Ao)||Pi„(A)||^+2A2||P^„(A)||?. 
Then, by the duality between the nuclear and the operator norms, wc obtain 

^||0(A)||^<2||I]||||P^„(A)||,+2||E||||Pi„(A)||^ 

+ 4AO(Ao)||PA„(A)||i-2AQ(Ao)||Pi„(A)||^ 
+ 2A2||P^„(A)||?. 
Using XQ{Aq) > 3 ||S|| we compute 

i \\0 (A)ll^ < yAQ(Ao) |1Pa„(A)|1i + 2X' 11Pa„(A)|1? 
which leads to 

i \\0 (A)ll^ < yAQ(Ao)V2rank(Ao)||A||2 + 4A2rank(Ao) ||A||^ . 
The condition 4/^77iim2A^rank(ylo) < 1/4 implies that 

- \\0 (A)ll^ < HAQ(Ao)V2rank(Ao)||A||2 + JM^_ (35) 

71 3 4/iTOiTO2 



Set 



AsQ - Ao 



By the definition of Asq we have that a < 2a. 



We now consider two cases, depending on whether the matrix — ( Asq — Aq 

a ^ 
belongs or not to the set C (18rank(74o)). 
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Case 1: Suppose first that 

implies tliat 

AsQ - Ao 



AsQ - Aq 



i2(n) 



„ / 64 log(d) , 
< a^^l- .^ _\ , then (3) 



log (6/5) n 



TO1TO2 



^<4aV /''^°^^') 



log (6/5) n 



(36) 



and we get the statement of of the Theorem 8 in this case 

2 



Case 2: It remains to consider the case 



1 



iSQ 



^0 



^^2 / 64 log(d) 
L2(n) ~ y log (6/5) n' 



Lenmia 15 implies that — I Asq — Aq\ G C (18rank(y4o)) and we can apply 

Lemma 12. From Lemma 12, (3) and (35) we obtain that, with probability at 

2 

least 1 one has 

a 



lAI 



14 



2fimim2 3 



A simple calculation yields 



< — AQ(^o)V2rank(Ao) ||A|| 



lAI, 



4/iTOim2 



792 aV"^l"^2rank(^o) (E (|1S;^|1))' 



lAII 



14 



\2^/imim2 3 



XQ{Ao)\/2Tank{Ao)fi'm,im2 I < 



— AQ(Ao)v/2rank(Ao)/^TOim2 ) + 792aV™iTO2rank(Ao) (E (||I]i?||))' 



and 



lAl 



28, 



^ , '^ < — AQ(^o) V2rank(Ao)Atmim2+V792 aVwim2rank(Ao) (E (||Ej^|| 
2^/imiTO2 3 '^ 

(37) 

This and a < 2a imply that, there exist numerical constant c'l such that 

UsQ^AoWl 



<d 



-\l?mim2 (Q2(Ao)A2rank(^o) + a2rank(Ao) (E (||S]fi||))') , 



mi77l2 

which, together with (36), leads to the statement of the Theorem 8. 



E Proof of Lemma 13 

By definition of P^ , for any matrix B the singular vectors of P4 {B) are or- 
thogonal to the space spanned by the singular vectors of Aq. This implies that 
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Ao + PX^{A-Ao) ^ = \\Ao\\,+ F^iA-Ao) 



. Then 



A :^ Ao+A~Ao 

Ao + P^^ {A~Ao) + Pao {A - Ao 
> \Ao + Pi„{A-Ao)\\^-\\PAo{A-Ao) 
= \\Ao\\, + \\piM~^o)\l - \\PaM~Ao) 
By the convexity of Q^{A) and using A > 3 A we have 



Q^{A) - Q^{Ao) > -- V {Y, - {X,,Ao)) {X,,A- Ao) 
n^ — ' 

1=1 

>-2l|E||l|i-Ao! 



>--AP-Aoi|i. 



Using the definition of A^ we compute 



This iniphes that 



as stated. 



A -APo|li<Q'(^o)-Q'(i) 



<-\\\A-Aoh. 



PiM-Ao) , <5 PaM-Aq) 



F Proof of Lemma 15 



By the convexity of Q{A) we have 

1 

^ 77 

Q{Asq) - Q(Ao) > : 



E(r,-(X„Ao))(X„isQ-Ao> 



Q{Ao) 



E, AsQ - Ao 



Q{Ao) 



> 



Q{Ao) 



AsQ-Ao\ 



>--\\\AsQ-Ao\\i. 



(38) 
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Using the definition of Asq, we compute 



A 



SQ 



X\\Ao\\,<Q{Ao)-Q{Asq) 
< -A||isQ-^o|li. 



Tfien (38) and tlie triangle inequality imply 

1 



Pi„(^-^o) ^- PAoiA-Ao) 



< - 
1 - 3 



Pi„(^-^o 



+ 



PaM-Ao) 



and the statement of Lemma 15 follows. 
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